This repository contains exploratory data analysis of the Perfume Dataset using R.
The global fragrance industry is both highly competitive and deeply shaped by cultural and consumer preferences. Beyond aesthetics, the market reflects evolving trends in gender identity, lifestyle choices, and purchasing behaviors. For brands, understanding these dynamics is critical in designing product portfolios, targeting marketing campaigns, and identifying opportunities for innovation.
In this report, we analyze a curated dataset of perfumes covering multiple dimensions, including brand, type, category, target audience, and longevity. Our objective is to uncover patterns that reveal how different factors interact and shape consumer preferences.
readr,dplyr,tidyr,stringr,janitor,ggplot2,ggrepel,scales,plotly,tibble# Define function to check is package is installed
install_if_missing <- function(pkg){
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg, dependencies = TRUE, repos = "https://cloud.r-project.org/")
library(pkg, character.only = TRUE)
}
}
# Install and load library
pkgs <- c("readr","dplyr","tidyr","stringr","janitor","ggplot2","ggrepel","scales","plotly","tibble","vcd")
invisible(lapply(pkgs, install_if_missing))
data/ → raw dataset and cleaned datasetscripts/ → R scripts for data cleaning, analysis,
visualization, modelingnotebooks/ → R Markdown for step-by-step analysisoutputs/ → figures, reportsdocs/ → research notes, methodology1.Market share of men’s and women’s fragrances 2.Number of perfumes under each brands 3.Market share of each category and type 4.Gender preference of category and type 5.Will type/Category influence longevity
ggplot2# Read in csv
perfume <- read.csv("Data/Perfumes_dataset.csv")
# Standarise
perfume <- perfume |>
janitor::clean_names() |>
dplyr::mutate(
brand = stringr::str_squish(brand),
perfume = stringr::str_squish(perfume),
type = stringr::str_squish(stringr::str_to_lower(type)), # e.g. "edp", "edt"
category = stringr::str_squish(stringr::str_to_title(category)), # "Fresh Scent" etc.
target_audience = stringr::str_squish(stringr::str_to_title(target_audience)), # "Male/Female/Unisex"
longevity = stringr::str_squish(stringr::str_to_title(longevity)) # "Strong/Medium/..."
)
perfume[1:10,]
## brand perfume type category target_audience longevity
## 1 dumont nitro red edp Fresh Scent Male Strong
## 2 dumont nitro pour homme edp Fresh Scent Male Strong
## 3 dumont nitro white edp Fresh Scent Unisex Strong
## 4 dumont nitro blue edp Fresh Scent Unisex Strong
## 5 dumont nitro green edp Fresh Scent Unisex Strong
## 6 dumont nitro platinum edp Mass Pleaser Male Strong
## 7 dumont nitro intense edp Woody Spicy Male Strong
## 8 dumont nitro black edp Woody Spicy Male Strong
## 9 dumont celerio oros edp Oriental Vanilla Unisex Medium
## 10 dumont celerio epic edp Woody Aromatic Male Medium
glimpse(perfume)
## Rows: 1,004
## Columns: 6
## $ brand <chr> "dumont", "dumont", "dumont", "dumont", "dumont", "dum…
## $ perfume <chr> "nitro red", "nitro pour homme", "nitro white", "nitro…
## $ type <chr> "edp", "edp", "edp", "edp", "edp", "edp", "edp", "edp"…
## $ category <chr> "Fresh Scent", "Fresh Scent", "Fresh Scent", "Fresh Sc…
## $ target_audience <chr> "Male", "Male", "Unisex", "Unisex", "Unisex", "Male", …
## $ longevity <chr> "Strong", "Strong", "Strong", "Strong", "Strong", "Str…
summary(perfume)
## brand perfume type category
## Length:1004 Length:1004 Length:1004 Length:1004
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## target_audience longevity
## Length:1004 Length:1004
## Class :character Class :character
## Mode :character Mode :character
brand – The company or label that produces the perfume (e.g., Dumont).
perfume – The name of the fragrance (e.g., Nitro Red).
type – Concentration or formulation of the perfume (e.g., EDP – Eau de Parfum).
category – Classification of the fragrance based on scent family or style (e.g., Fresh Scent, Woody Spicy, Oriental Vanilla).
target_audience – The intended wearer of the perfume (e.g., Male, Female, Unisex).
longevity – Expected performance in terms of duration on the skin (e.g., Strong, Medium).
Nitro Red (Dumont, EDP) – A fresh scent designed for men with strong longevity.
Celerio Oros (Dumont, EDP) – An oriental vanilla fragrance suitable for unisex wearers with medium longevity.
Nitro Black (Dumont, EDP) – A woody spicy perfume for men with strong performance.
# Find number of unique value in each column
sapply(perfume, function(x) length(unique(x)))
## brand perfume type category target_audience
## 55 940 11 157 7
## longevity
## 13
brand_counts <- perfume |>
dplyr::count(brand, name = "n") |>
dplyr::arrange(dplyr::desc(n))
print(head(brand_counts, 20)) # 前 20 个品牌
## brand n
## 1 Jean Paul Gaultier 94
## 2 paris corner 76
## 3 armaf 70
## 4 fragrance world 42
## 5 Al Haramain 37
## 6 Azzaro 35
## 7 Lattafa 33
## 8 Afnan 30
## 9 Dior 29
## 10 Maison Alhambra 25
## 11 Creed 24
## 12 Victoria's Secret 24
## 13 Louis Vuitton 23
## 14 Hermès 22
## 15 Ajmal 21
## 16 Prada 20
## 17 Carolina Herrera 19
## 18 Dolce & Gabbana 19
## 19 xerjoff 18
## 20 Parfums de Marly 17
# 如果要全部,请直接 print(brand_counts)
top10 <- brand_counts[1:10,]
p_bar <- ggplot(top10, aes(x = brand, y = n, fill = brand)) +
geom_col(width = 0.7, color = "white") +
geom_text(aes(label = n), hjust = 1.02, size = 3.8) + # 数字在条内右侧
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, .05))) +
scale_fill_brewer(palette = "Paired") +
labs(
title = "Top 10 Brands by Number of Perfumes",
x = "Brand", y = "Count", fill = "Brand"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none",
panel.grid.major.y = element_blank())
print(p_bar)
# ------------- B) Pie Chart (Top10 + Other) -------------
# ---- Top 10 品牌数据 ----
top10 <- perfume %>%
count(brand, name = "n") %>%
arrange(desc(n)) %>%
slice_max(n, n = 10) %>%
mutate(share = n / sum(n),
label = percent(share, accuracy = 0.1),
ypos = cumsum(share) - share/2) # 每个扇区中点
# ---- 绘制最简单的饼图 ----
ggplot(top10, aes(x = 1, y = share, fill = brand)) +
geom_col(width = 1, color = "white") +
coord_polar(theta = "y") +
geom_text(aes(y = ypos, label = label),
color = "white", size = 4, fontface = "bold") +
scale_fill_brewer(palette = "Set3") +
labs(
title = "Market Share of Top 10 Perfume Brands",
fill = "Brand"
) +
theme_void(base_size = 14) +
theme(legend.position = "right")
Our analysis of the top 10 perfume brands highlights the competitive landscape in terms of product portfolio size:
Jean Paul Gaultier dominates
With 94 perfumes, Jean Paul Gaultier leads the market by a significant margin.
This represents the largest single brand share (19.2%) among the top 10, showing its strong focus on product variety and innovation.
Strong challengers in the mid-range
Paris Corner (77 perfumes, 15.7%) and Armaf (70 perfumes, 14.3%) follow closely, together holding nearly one-third of the market within the top 10 brands.
These brands have built large and diverse portfolios, indicating aggressive strategies in product expansion.
Other notable players
Al Haramain (43, 8.6%), Fragrance World (42, 8.6%), and Lattafa (36, 7.1%) form a competitive mid-tier.
Traditional luxury brands like Giorgio Armani (30, 6.1%), Hugo Boss (33, 6.7%), and Azzaro (35, 7.3%) maintain stable presence but with smaller portfolios relative to the leaders.
📊 Key Insights
Jean Paul Gaultier, Paris Corner, and Armaf collectively account for almost 50% of the top 10 market share, making them the clear leaders in terms of product variety.
Luxury houses (Armani, Hugo Boss, Azzaro) have comparatively smaller portfolios, but they may rely more on brand equity and premium positioning than sheer volume.
Emerging and Middle Eastern brands (Lattafa, Al Haramain, Fragrance World) are significant players, reflecting the globalization of perfume markets and the growing importance of niche/affordable luxury brands.
gender_in_category <- perfume |>
dplyr::count(category, target_audience, name = "n") |>
dplyr::group_by(category) |>
dplyr::mutate(share_within_category = n / sum(n)) |>
dplyr::arrange(category, dplyr::desc(share_within_category)) |>
dplyr::ungroup()
print(head(gender_in_category, 30))
## # A tibble: 30 × 4
## category target_audience n share_within_category
## <chr> <chr> <int> <dbl>
## 1 Amber Female 1 0.5
## 2 Amber Unisex 1 0.5
## 3 Amber Floral Unisex 31 0.775
## 4 Amber Floral Female 9 0.225
## 5 Amber Fougere Male 1 1
## 6 Amber Fougère Unisex 1 1
## 7 Amber Leather Unisex 1 1
## 8 Amber Musk Unisex 2 1
## 9 Amber Oriental Unisex 2 1
## 10 Amber Oud Unisex 2 1
## # ℹ 20 more rows
gender_in_type <- perfume |>
dplyr::count(type, target_audience, name = "n") |>
dplyr::group_by(type) |>
dplyr::mutate(share_within_type = n / sum(n)) |>
dplyr::arrange(type, dplyr::desc(share_within_type)) |>
dplyr::ungroup()
print(gender_in_type)
## # A tibble: 18 × 4
## type target_audience n share_within_type
## <chr> <chr> <int> <dbl>
## 1 alcohol-free Unisex 1 1
## 2 attar Unisex 1 1
## 3 cologne Female 6 0.545
## 4 cologne Unisex 5 0.455
## 5 concentrate Unisex 2 1
## 6 edp Unisex 311 0.452
## 7 edp Female 224 0.326
## 8 edp Male 153 0.222
## 9 edt Female 69 0.527
## 10 edt Male 35 0.267
## 11 edt Unisex 27 0.206
## 12 extrait Unisex 4 1
## 13 extrait de parfum Unisex 16 0.941
## 14 extrait de parfum Female 1 0.0588
## 15 oil Unisex 3 1
## 16 parfum Male 20 0.541
## 17 parfum Female 12 0.324
## 18 parfum Unisex 5 0.135
# ====== A) Gender × Category ======
tab_cat <- table(perfume$target_audience, perfume$category)
# 卡方检验
chi_cat <- chisq.test(tab_cat)
print(chi_cat)
##
## Pearson's Chi-squared test
##
## data: tab_cat
## X-squared = 1137.3, df = 288, p-value < 2.2e-16
# Cramer's V
cramer_v_cat <- sqrt(chi_cat$statistic / (sum(tab_cat) * (min(dim(tab_cat)) - 1)))
cat("Cramer's V (Gender × Category):", cramer_v_cat, "\n")
## Cramer's V (Gender × Category): 0.7970959
# 残差矩阵转长表
resid_cat <- as.data.frame(as.table(chi_cat$residuals))
colnames(resid_cat) <- c("Gender", "Category", "Residual")
# Top 20 绝对残差
top20_resid_cat <- resid_cat %>%
arrange(desc(abs(Residual))) %>%
slice_head(n = 20)
# 可视化:残差条形图
ggplot(top20_resid_cat, aes(x = reorder(paste(Category, Gender, sep=" - "), abs(Residual)),
y = Residual, fill = Residual > 0)) +
geom_col(width = 0.7) +
coord_flip() +
scale_fill_manual(values=c("TRUE"="steelblue","FALSE"="tomato"),
labels=c("FALSE"="Under-represented","TRUE"="Over-represented")) +
labs(title="Top 20 Residuals: Gender × Category",
x="Category - Gender", y="Pearson Residual", fill="Interpretation") +
theme_minimal(base_size=13)
# ====== 工具函数:Cramér’s V ======
cramers_v <- function(tbl){
chisq <- suppressWarnings(chisq.test(tbl))
chi2 <- unname(chisq$statistic)
n <- sum(tbl)
r <- nrow(tbl)
c <- ncol(tbl)
V <- sqrt(chi2 / (n * (min(r-1, c-1))))
list(
chisq_test = chisq,
cramer_v = V
)
}
# ====== A) Type × Longevity ======
tab_type_long <- table(perfume$type, perfume$longevity)
res_type_long <- cramers_v(tab_type_long)
cat("\n== Q5: Type × Longevity ==\n")
##
## == Q5: Type × Longevity ==
print(res_type_long$chisq_test)
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 770.07, df = 99, p-value < 2.2e-16
cat(sprintf("Cramer's V: %.3f\n", res_type_long$cramer_v))
## Cramer's V: 0.309
# 残差分析
resid_type <- as.data.frame(as.table(res_type_long$chisq_test$residuals))
colnames(resid_type) <- c("type", "longevity", "residual")
# Top 20 绝对残差
top20_type <- resid_type %>%
arrange(desc(abs(residual))) %>%
slice_head(n = 20)
ggplot(top20_type, aes(x = reorder(paste(type, longevity, sep = " - "), abs(residual)),
y = residual, fill = residual > 0)) +
geom_col(width = 0.7) +
coord_flip() +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato"),
labels = c("FALSE" = "Under-represented", "TRUE" = "Over-represented")) +
labs(
title = "Top 20 Residuals: Type × Longevity",
x = "Type - Longevity",
y = "Pearson Residual",
fill = "Interpretation"
) +
theme_minimal(base_size = 13)
# ====== B) Category × Longevity ======
tab_cat_long <- table(perfume$category, perfume$longevity)
res_cat_long <- cramers_v(tab_cat_long)
cat("\n== Q5: Category × Longevity ==\n")
##
## == Q5: Category × Longevity ==
print(res_cat_long$chisq_test)
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 4818.4, df = 1584, p-value < 2.2e-16
cat(sprintf("Cramer's V: %.3f\n", res_cat_long$cramer_v))
## Cramer's V: 0.700
# 残差分析
resid_cat <- as.data.frame(as.table(res_cat_long$chisq_test$residuals))
colnames(resid_cat) <- c("category", "longevity", "residual")
# Top 20 绝对残差
top20_cat <- resid_cat %>%
arrange(desc(abs(residual))) %>%
slice_head(n = 20)
ggplot(top20_cat, aes(x = reorder(paste(category, longevity, sep = " - "), abs(residual)),
y = residual, fill = residual > 0)) +
geom_col(width = 0.7) +
coord_flip() +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
labs(
title = "Top 20 Residuals: Category × Longevity",
x = "Category - Longevity",
y = "Pearson Residual",
fill = "Interpretation"
) +
theme_minimal(base_size = 13)
Our analysis focused on the relationship between fragrance type/category
and longevity.
Using chi-square tests and residual analysis, we found a statistically significant association between the two, with very clear directional patterns.
Long-Lasting Fragrances (Very Strong / Strong)
Extrait de Parfum (high-concentration perfumes) is heavily over-represented in the Very Strong longevity group. This aligns perfectly with product positioning: higher concentrations naturally lead to longer-lasting scents.
Woody, Oud, and Rose categories are also strongly over-represented in the Strong group, showing that these ingredients are typically linked with longer longevity.
Conclusion: High-concentration formats combined with deep woody/rose-based notes are the typical market choice for long-lasting perfumes.
Lighter Longevity (Light / Medium)
Eau de Toilette (EDT) is strongly over-represented in the Light group and severely under-represented in the Strong group.
Similarly, fresh and floral light categories tend to underperform in the Medium group, indicating a preference for shorter, lighter experiences.
Conclusion: Lighter concentrations and fresher scent profiles naturally lean toward shorter-lasting usage.
Under-Represented Segments
Many Floral and Oriental Floral fragrances are under-represented in the Medium group. This suggests a “polarized” pattern: they are either formulated as light, fleeting perfumes or pushed directly into strong, long-lasting territory.
Conclusion: Certain categories show a two-pole distribution, rarely occupying the middle ground.
📊 Key Insights
Type and category do influence longevity, and the findings are consistent with fragrance industry intuition:
Higher concentration + heavier notes → longer-lasting scents.
Lower concentration + fresher notes → lighter, shorter-lasting scents.
Business implications:
For markets demanding long-lasting performance, brands should prioritize Extrait de Parfum formats and emphasize Woody / Oud / Rose compositions.
For everyday, casual consumers, the focus should be on EDT / fresh scents.
This analysis bridges consumer expectations with product design decisions, helping brands position products more strategically.
This analysis of the perfume dataset provides a structured view of how the fragrance market is shaped by audience preferences, brand strategies, product categories, and technical attributes such as type and longevity. From Q1 through Q5, several key insights emerge:
Unisex fragrances are no longer niche (Q1). With over one-third of the market, unisex perfumes have surpassed both male- and female-targeted products, reflecting a broad cultural shift toward inclusivity and flexibility in personal expression.
A few brands dominate through large product portfolios (Q2). Jean Paul Gaultier, Paris Corner, and Armaf together account for nearly half of the top 10 market share. Traditional luxury brands remain influential but compete more on brand equity than on sheer variety.
Woody, spicy, and floral–oriental blends define the mainstream market (Q3). Categories such as Woody Spicy and Floriental capture the largest shares, while Eau de Parfum (EDP) is the overwhelmingly dominant type. Niche categories like fresh or aquatic scents remain underrepresented, yet may offer opportunities for differentiation.
Gender preferences are statistically significant and structured (Q4). Chi-square tests confirm strong associations: floral and fruity categories are over-represented among female products, woody and spicy categories dominate male lines, while some blends (e.g., Amber Woody) successfully bridge into unisex markets. This highlights both the persistence of traditional preferences and areas of convergence.
Longevity is shaped by structural choices (Q5). Certain categories and types are systematically associated with stronger or longer-lasting scents, suggesting that product design choices directly influence consumer perception of durability and value.
📊 Strategic Takeaways
Invest in unisex product lines: Demand for gender-neutral fragrances has become mainstream.
Differentiate within dominant categories: The woody and floral–oriental spaces are crowded; innovation is required to stand out.
Balance portfolio strategy: Brands can win either through scale (broad product ranges) or through premium positioning with smaller but iconic collections.
Leverage longevity as a value driver: Positioning long-lasting perfumes within competitive categories may strengthen consumer trust and pricing power.